Skip to content

Implement VAES AVX and AVX512 backends for aes #482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 11, 2025

Conversation

silvanshade
Copy link
Contributor

@silvanshade silvanshade commented May 29, 2025

Here is an updated implementation of VAES256 and VAES512 support for aes.

Benchmarks are from a Ryzen 9950X3D.

VAES512

RUSTFLAGS="-Ctarget-cpu=native" cargo +nightly bench

test aes128_decrypt_block  ... bench:         948.52 ns/iter (+/- 9.78) = 17282 MB/s
test aes128_decrypt_blocks ... bench:         249.35 ns/iter (+/- 0.38) = 65799 MB/s
test aes128_encrypt_block  ... bench:         951.15 ns/iter (+/- 7.81) = 17228 MB/s
test aes128_encrypt_blocks ... bench:         249.41 ns/iter (+/- 0.50) = 65799 MB/s
test aes128_new            ... bench:           8.99 ns/iter (+/- 0.01)
test aes192_decrypt_block  ... bench:       1,176.52 ns/iter (+/- 18.27) = 13931 MB/s
test aes192_decrypt_blocks ... bench:         304.76 ns/iter (+/- 20.07) = 53894 MB/s
test aes192_encrypt_block  ... bench:       1,206.06 ns/iter (+/- 39.16) = 13585 MB/s
test aes192_encrypt_blocks ... bench:         297.87 ns/iter (+/- 26.79) = 55164 MB/s
test aes192_new            ... bench:          11.45 ns/iter (+/- 0.60)
test aes256_decrypt_block  ... bench:       1,468.79 ns/iter (+/- 13.34) = 11160 MB/s
test aes256_decrypt_blocks ... bench:         344.53 ns/iter (+/- 3.02) = 47627 MB/s
test aes256_encrypt_block  ... bench:       1,472.97 ns/iter (+/- 24.24) = 11130 MB/s
test aes256_encrypt_blocks ... bench:         346.99 ns/iter (+/- 2.39) = 47352 MB/s
test aes256_new            ... bench:          12.31 ns/iter (+/- 0.02)

VAES256

RUSTFLAGS="-Ctarget-cpu=native --cfg aes_avx512_disable" cargo +nightly bench

test aes128_decrypt_block  ... bench:         916.29 ns/iter (+/- 12.56) = 17886 MB/s
test aes128_decrypt_blocks ... bench:         443.50 ns/iter (+/- 4.39) = 36984 MB/s
test aes128_encrypt_block  ... bench:         918.70 ns/iter (+/- 17.38) = 17847 MB/s
test aes128_encrypt_blocks ... bench:         451.14 ns/iter (+/- 1.01) = 36328 MB/s
test aes128_new            ... bench:           8.82 ns/iter (+/- 0.01)
test aes192_decrypt_block  ... bench:       1,141.80 ns/iter (+/- 8.07) = 14359 MB/s
test aes192_decrypt_blocks ... bench:         542.46 ns/iter (+/- 6.65) = 30228 MB/s
test aes192_encrypt_block  ... bench:       1,143.93 ns/iter (+/- 11.54) = 14334 MB/s
test aes192_encrypt_blocks ... bench:         540.67 ns/iter (+/- 2.86) = 30340 MB/s
test aes192_new            ... bench:          10.83 ns/iter (+/- 0.01)
test aes256_decrypt_block  ... bench:       1,434.76 ns/iter (+/- 20.45) = 11425 MB/s
test aes256_decrypt_blocks ... bench:         629.03 ns/iter (+/- 0.54) = 26047 MB/s
test aes256_encrypt_block  ... bench:       1,432.48 ns/iter (+/- 21.32) = 11441 MB/s
test aes256_encrypt_blocks ... bench:         629.17 ns/iter (+/- 3.79) = 26047 MB/s
test aes256_new            ... bench:          11.90 ns/iter (+/- 0.54)

AES-NI

RUSTFLAGS="-Ctarget-cpu=native --cfg aes_avx512_disable --cfg aes_avx256_disable" cargo +nightly bench

test aes128_decrypt_block  ... bench:         949.50 ns/iter (+/- 14.32) = 17264 MB/s
test aes128_decrypt_blocks ... bench:         937.90 ns/iter (+/- 8.32) = 17485 MB/s
test aes128_encrypt_block  ... bench:         952.43 ns/iter (+/- 12.84) = 17210 MB/s
test aes128_encrypt_blocks ... bench:         938.51 ns/iter (+/- 11.09) = 17466 MB/s
test aes128_new            ... bench:           8.96 ns/iter (+/- 0.02)
test aes192_decrypt_block  ... bench:       1,174.35 ns/iter (+/- 11.97) = 13955 MB/s
test aes192_decrypt_blocks ... bench:       1,139.37 ns/iter (+/- 3.99) = 14384 MB/s
test aes192_encrypt_block  ... bench:       1,176.52 ns/iter (+/- 9.51) = 13931 MB/s
test aes192_encrypt_blocks ... bench:       1,139.30 ns/iter (+/- 4.89) = 14384 MB/s
test aes192_new            ... bench:          11.08 ns/iter (+/- 0.01)
test aes256_decrypt_block  ... bench:       1,470.33 ns/iter (+/- 22.56) = 11145 MB/s
test aes256_decrypt_blocks ... bench:       1,327.11 ns/iter (+/- 15.53) = 12346 MB/s
test aes256_encrypt_block  ... bench:       1,432.09 ns/iter (+/- 10.22) = 11441 MB/s
test aes256_encrypt_blocks ... bench:       1,288.63 ns/iter (+/- 11.47) = 12720 MB/s
test aes256_new            ... bench:          12.25 ns/iter (+/- 0.31)

@silvanshade silvanshade mentioned this pull request May 29, 2025
@tarcieri tarcieri requested a review from newpavlov May 29, 2025 18:01
@silvanshade silvanshade force-pushed the vaes branch 4 times, most recently from ae7ec9c to 93e30c4 Compare May 31, 2025 01:06
@newpavlov
Copy link
Member

newpavlov commented Jun 2, 2025

@tarcieri
Maybe we should make the VAES support a nightly-only feature enabled using configuration flags? With the current implementation round keys and blocks can not be kept in registers and have to be spilled to stack. I have concerns that supporting VAES by default like this would result in inefficient code when combined with other crates like ctr and ghash.

@tarcieri
Copy link
Member

tarcieri commented Jun 2, 2025

I think making VAES opt-in via cfg is fine for getting this PR merged, but note that AVX-512 support just passed FCP and avx512_target_feature will be stable on Rust 1.89, I believe, so nightly will soon be unnecessary

@newpavlov
Copy link
Member

In that case, I think we should release aes v0.9.0 without VAES support and shortly after release aes v0.9.1 with intrinsics-based VAES support. This would allow users who don't use the latest stable to use v0.9.0 thanks to the MSRV-aware resolver.

@tarcieri
Copy link
Member

tarcieri commented Jun 2, 2025

I'm not sure how long it will take to stabilize all the relevant intrinsics. If they're through FCP, hopefully soon? If they take awhile we can polyfill them with asm!

@newpavlov
Copy link
Member

newpavlov commented Jun 2, 2025

The intrinsics (e.g. _mm512_aesenc_epi128) link to the stdarch_x86_avx512 issue, so I assumed that intrinsics will be stabilized as part of this FCP. The relevant target features were already stabilized: rust-lang/rust#138940

aes/Cargo.toml Outdated
@@ -31,7 +31,7 @@ hazmat = [] # Expose cryptographically hazardous APIs

[lints.rust.unexpected_cfgs]
level = "warn"
check-cfg = ["cfg(aes_compact)", "cfg(aes_force_soft)"]
check-cfg = ["cfg(aes_compact)", "cfg(aes_force_soft)", "cfg(avx256_disable)", "cfg(avx512_disable)"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps namespace these configs under aes_* like the others?

Suggested change
check-cfg = ["cfg(aes_compact)", "cfg(aes_force_soft)", "cfg(avx256_disable)", "cfg(avx512_disable)"]
check-cfg = ["cfg(aes_compact)", "cfg(aes_force_soft)", "cfg(aes_avx256_disable)", "cfg(aes_avx512_disable)"]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, this would be handled with "negative" target features, but alas. :/

Copy link
Member

@newpavlov newpavlov Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also may be worth to name it just aes_disable_vaes. It does not look like we have different target features for VAES256 and VAES512 intrinsiscs, so it may be worth to just remove the vaes256 backend.

UPD: Ah, VAES512 intrinsics are also gated on avx512f. Though I wonder if there are CPUs which support VAES256, but not VAES512.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into it and it looks like some CPU families support VAES256, but not VAES512 (e.g. Zen 3).

Copy link
Contributor Author

@silvanshade silvanshade Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also may be worth to name it just aes_disable_vaes. It does not look like we have different target features for VAES256 and VAES512 intrinsiscs, so it may be worth to just remove the vaes256 backend.

Intel started disabling AVX512 on Alder Lake CPUs if I remember correctly and had at one point stated in their roadmap that they intended not to include it on newer consumer oriented CPUs. That was another reason I included VAES256.

It sounds like maybe Intel's plans have changed though: https://www.phoronix.com/news/Intel-AVX10-Drops-256-Bit

But I also use the VAES256 backend as a fallback for encoding tail blocks:

#[inline]
fn encrypt_tail_blocks(&self, blocks: InOutBuf<'_, '_, Block>) {
let backend = self;
let mut rem = blocks.len();
let (mut iptr, mut optr) = blocks.into_raw();
let backend = &$name_backend::Vaes256::from(backend);
if backend.features.has_vaes256() {
while rem >= backend.par_blocks() {
let blocks = unsafe { InOut::from_raw(iptr.cast(), optr.cast()) };
backend.encrypt_par_blocks(blocks);
rem -= backend.par_blocks();
iptr = unsafe { iptr.add(backend.par_blocks()) };
optr = unsafe { optr.add(backend.par_blocks()) };
}
}
let backend = &$name_backend::Ni::from(backend);
while rem >= backend.par_blocks() {
let blocks = unsafe { InOut::from_raw(iptr.cast(), optr.cast()) };
backend.encrypt_par_blocks(blocks);
rem -= backend.par_blocks();
iptr = unsafe { iptr.add(backend.par_blocks()) };
optr = unsafe { optr.add(backend.par_blocks()) };
}
while rem > 0 {
let block = unsafe { InOut::from_raw(iptr, optr) };
backend.encrypt_block(block);
rem -= 1;
iptr = unsafe { iptr.add(1) };
optr = unsafe { optr.add(1) };
}
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed the config flags with the aes_ prefix as suggested.

@silvanshade
Copy link
Contributor Author

In that case, I think we should release aes v0.9.0 without VAES support and shortly after release aes v0.9.1 with intrinsics-based VAES support. This would allow users who don't use the latest stable to use v0.9.0 thanks to the MSRV-aware resolver.

Just so we understand each other, I have no intention of refactoring this code further, especially not a full rewrite using intriniscs, without first seeing:

  1. A full review of what is currently in this PR.
  2. A clear outline of what conditions need to be met to for the PR to be merged.

Copy link
Member

@tarcieri tarcieri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just merge this and address the remaining problems in followup PRs

@silvanshade
Copy link
Contributor Author

I think we should just merge this and address the remaining problems in followup PRs

I think that would be a good approach.

I'm willing to help maintain and continue to improve the code if it's going to be used.

@tarcieri
Copy link
Member

tarcieri commented Jun 10, 2025

@newpavlov does that sound good to you?

Edit: since it's nightly-only that effectively means it's unstable anyway. We can call it out as such until such a time that we can stabilize it, but I think retaining the cfg gating would still be good until we're actually sure about the performance implications of having it on-by-default.

@newpavlov
Copy link
Member

Personally, I would prefer to have a proper intrinsics-based implementation released in v0.9.1, but if there is a desire to have an AVX-512 support in v0.9.0, I am fine with merging this. Although, I believe that in the latter case it should be an experimental Nighlty-only backend gated behind a cfg flag. Plus, I don't think it makes sense to have the OnceCell caching of broadcasted keys, I think it only would obstruct compiler optimizations. But since @silvanshade did not give us rights to edit this PR, we can change in later PRs.

@newpavlov newpavlov merged commit ad83428 into RustCrypto:master Jun 11, 2025
163 checks passed
@silvanshade
Copy link
Contributor Author

First, thanks for finally merging the PR.

Plus, I don't think it makes sense to have the OnceCell caching of broadcasted keys, I think it only would obstruct compiler optimizations.

That design has been in place since the original PR, which is a long time ago now. You could have reviewed the design, performed your own benchmarks, and proposed a better alternative.

I would have welcomed feedback on a better design.

Unless the cipher API is redesigned to separate the backends, I don't see how you avoid a compromise.

If you always store the broadcast keys, you get a very measurable slowdown for the non-VAES case which I showed in the previous thread. If you rebroadcast for every chunk of blocks for VAES, you also get a slowdown from unnecessary work.

Caching the keys on demand seems to alleviate those performance hits in the benchmarks.

But since @silvanshade did not give us rights to edit this PR, we can change in later PRs.

You could have asked me to enable the option for maintainer commits if you had changes you wanted to make.

I would have expected we would have discussed those in a review first though.

@tarcieri
Copy link
Member

Now that we have the core design merged we can experiment with those details without them blocking the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants